The dataset I have selected, “Using RNA sequencing to examine age-dependent skeletal muscle transcriptome response to bed rest-induced atrophy, and age independent disuse-induced insulin resistance” (GSE113165), records the transcriptome of vastus lateralis cells of YOUNG (N=9, 18-28 y) and OLD (N=18, 60-79 y) men and women before five days of bed rest (control) and after five days of bed rest (test condition). The subjects are also classified into susceptability (i.e. low, high) to disuse-induced insulin resistance. The study aims to understand gene expression associated with bed rest to offset resulting muscle loss.
In the previous assessment in this course BCB420, Assignment 1, the dataset was processed - cleaned for low counts, normalized using TMM (Trimmed Mean of M-Values) and mapped to HUGO identifiers. Duplicate HUGO ids were averaged and genes with empty/unavailable HUGO ids were removed. The initial downloaded GEO dataset contained 58051 gene ids, however the final processed dataset, contained 14969 gene ids.
Publication Title: Age-dependent skeletal muscle transcriptome response to bed rest-induced atrophy.
Publication Date: 2019-04-01
Publication Journal: J Appl Physiol
GEO ID: GSE113165
# check to ensure all needed packages are installed
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("GEOquery", quietly = TRUE))
BiocManager::install("GEOquery")
if (!requireNamespace("biomaRt", quietly = TRUE))
BiocManager::install("biomaRt")
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("knitr", quietly = TRUE))
BiocManager::install("knitr")
library(GEOquery)
library(biomaRt)
library(edgeR)
library(knitr)
We load our information about our datased using the GEO id, we then extract our desired information to produce our sample set with appropriate labels.
GSE113165 <- GEOquery::getGEO("GSE113165", GSEMatrix =TRUE)
show(GSE113165)
GSE113165_data <- (pData(phenoData(GSE113165[[1]])))
#Create sample set
samples <- data.frame(GSE113165_data$description.1, GSE113165_data$`subject:ch1`, GSE113165_data$`age:ch1`, GSE113165_data$`Sex:ch1`, gsub("susceptibility: ", "", GSE113165_data$characteristics_ch1.5), gsub("time: ", "", GSE113165_data$characteristics_ch1.1))
colnames(samples) <- c("id", "subject","age", "sex", "susceptibility", "time")
#Create sample set
samples <- data.frame(GSE113165_data$`subject:ch1`, GSE113165_data$`age:ch1`, GSE113165_data$`Sex:ch1`, gsub("susceptibility: ", "", GSE113165_data$characteristics_ch1.5), gsub("time: ", "", GSE113165_data$characteristics_ch1.1))
samples <- t(samples)
colnames(samples) <- GSE113165_data$description.1
rownames(samples) <- c("subject","age", "sex", "susceptibility", "time")
Expression count data is downloaded and accessed here. Formatted as needed.
#Reading in supplementary files with expression counts
sfiles <- getGEOSuppFiles('GSE113165')
fnames <- rownames(sfiles)
expCounts <- read.delim(fnames[1],header=TRUE, check.names = FALSE)
#Preview of data
kable(head(expCounts))
| geneid | 10876X1 | 10876X2 | 10876X3 | 10876X4 | 10876X5 | 10876X6 | 10876X7 | 10876X8 | 10876X9 | 14406X1 | 14406X2 | 14406X3 | 14406X4 | 14406X5 | 14406X6 | 14406X7 | 14406X8 | 14406X9 | 10876X10 | 10876X11 | 10876X12 | 10876X13 | 10876X14 | 10876X15 | 10876X16 | 10876X17 | 10876X18 | 10876X19 | 10876X20 | 10876X21 | 10876X22 | 10876X23 | 10876X24 | 10876X25 | 10876X26 | 10876X27 | 10876X28 | 10876X29 | 10876X30 | 10876X31 | 10876X32 | 10876X33 | 10876X34 | 10876X35 | 10876X36 | 14406X10 | 14406X11 | 14406X12 | 14406X13 | 14406X14 | 14406X15 | 14406X16 | 14406X17 | 14406X18 | 14406X19 | 14406X20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ENSG00000000003 | 49 | 33 | 51 | 25 | 54 | 42 | 30 | 28 | 61 | 67 | 51 | 63 | 49 | 59 | 72 | 42 | 51 | 87 | 33 | 17 | 14 | 34 | 38 | 65 | 67 | 111 | 93 | 32 | 41 | 46 | 39 | 41 | 30 | 33 | 40 | 40 | 39 | 35 | 51 | 34 | 47 | 37 | 25 | 28 | 62 | 50 | 24 | 57 | 72 | 78 | 54 | 52 | 128 | 75 | 105 | 99 |
| ENSG00000000005 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 1 | 0 | 1 | 1 | 2 | 10 | 0 | 4 | 0 | 0 | 0 | 1 | 0 | 2 | 4 | 2 | 2 | 2 | 1 | 3 | 1 | 4 | 1 | 2 | 1 | 4 | 3 | 4 | 1 | 0 | 3 | 5 | 1 | 3 | 2 | 2 | 3 | 7 | 2 | 0 | 2 | 1 | 3 | 1 | 0 | 2 |
| ENSG00000000419 | 207 | 224 | 236 | 102 | 298 | 167 | 115 | 109 | 268 | 352 | 298 | 359 | 288 | 278 | 624 | 279 | 285 | 404 | 181 | 143 | 63 | 185 | 189 | 299 | 183 | 449 | 396 | 90 | 110 | 212 | 237 | 202 | 167 | 175 | 239 | 194 | 223 | 217 | 217 | 235 | 269 | 244 | 149 | 277 | 288 | 267 | 195 | 192 | 408 | 345 | 255 | 229 | 404 | 271 | 625 | 401 |
| ENSG00000000457 | 136 | 112 | 118 | 94 | 194 | 169 | 106 | 96 | 169 | 176 | 220 | 196 | 222 | 243 | 297 | 216 | 212 | 265 | 141 | 115 | 103 | 108 | 134 | 174 | 165 | 265 | 276 | 95 | 146 | 164 | 146 | 109 | 145 | 128 | 144 | 181 | 125 | 136 | 141 | 142 | 178 | 193 | 140 | 200 | 271 | 218 | 166 | 211 | 221 | 227 | 187 | 199 | 384 | 319 | 419 | 345 |
| ENSG00000000460 | 23 | 19 | 26 | 21 | 39 | 32 | 20 | 18 | 45 | 69 | 68 | 48 | 59 | 46 | 65 | 88 | 98 | 82 | 30 | 12 | 24 | 18 | 16 | 37 | 32 | 73 | 76 | 30 | 34 | 44 | 28 | 37 | 36 | 49 | 53 | 44 | 35 | 38 | 36 | 44 | 48 | 51 | 38 | 52 | 66 | 66 | 39 | 65 | 75 | 65 | 56 | 70 | 165 | 88 | 120 | 110 |
| ENSG00000000938 | 27 | 7 | 7 | 5 | 29 | 8 | 9 | 10 | 7 | 15 | 24 | 17 | 14 | 43 | 36 | 27 | 12 | 25 | 25 | 76 | 33 | 7 | 6 | 7 | 15 | 26 | 55 | 15 | 14 | 7 | 14 | 13 | 9 | 9 | 12 | 12 | 32 | 12 | 23 | 6 | 9 | 35 | 11 | 766 | 77 | 16 | 35 | 69 | 20 | 16 | 18 | 24 | 53 | 31 | 42 | 50 |
Here we remove genes with low expression count to clean up the dataset.
dim(expCounts)
## [1] 58051 57
#Removing low counts
cpms <- cpm(expCounts[2:57])
rownames(cpms) <- expCounts[,1]
keep <- rowSums(cpms >1) >=3
expCounts_filtered <- expCounts[keep,]
dim(expCounts_filtered)
## [1] 16674 57
#Original Box Plot
data2plot <- log2(cpm(expCounts_filtered[2:57]))
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "GSE113165 Samples")
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")
We will apply TMM (Trimmed Mean of M-Values) to normalize the filtered data. As you can see, the plot (and of course the data) adjusts so all the samples have a similar mean.
#Normalization
filtered_data_matrix <- as.matrix(expCounts_filtered[2:57])
rownames(filtered_data_matrix) <- expCounts_filtered$geneid
d = DGEList(counts=filtered_data_matrix, group=samples["time",])
d = calcNormFactors(d)
normalized_counts <- cpm(d)
#Normalized Box Plot
data2plot <- log2(normalized_counts)
boxplot(data2plot, xlab = "Samples", ylab = "log2 CPM",
las = 2, cex = 0.5, cex.lab = 0.5,
cex.axis = 0.5, main = "Normalized")
abline(h = median(apply(data2plot, 2, median)), col = "green", lwd = 0.6, lty = "dashed")
We can also view the data in terms of a density plot, before and after normalization. We can see that the distribution (sample lines) are tighter and more similar.
#Density Plot
counts_density <- apply(log2(filtered_data_matrix), 2, density)
#Calculate the Limits
xlim <- 0; ylim <- 0
for (i in 1:length(counts_density)) {
xlim <- range(c(xlim, counts_density[[i]]$x));
ylim <- range(c(ylim, counts_density[[i]]$y))
}
cols <- rainbow(length(counts_density))
ltys <- rep(1, length(counts_density))
#plot the first density plot to initialize the plot
par(fig=c(0,0.5,0,1), new=FALSE)
plot(counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="Initial", cex.lab = 0.85)
#plot each line
for (i in 1:length(counts_density)) lines(counts_density[[i]], col=cols[i], lty=ltys[i])
normalized_counts_density <- apply(log2(normalized_counts), 2, density)
#Calculate the Mormalized Limits
xlim <- 0; ylim <- 0
for (i in 1:length(normalized_counts_density)) {
xlim <- range(c(xlim, normalized_counts_density[[i]]$x));
ylim <- range(c(ylim, normalized_counts_density[[i]]$y))
}
cols <- rainbow(length(normalized_counts_density))
ltys <- rep(1, length(normalized_counts_density))
#Normalized Density Plot
par(fig=c(0.50,1,0,1), new=TRUE)
plot(normalized_counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="Normalized", cex.lab = 0.85)
for (i in 1:length(normalized_counts_density)) lines(normalized_counts_density[[i]], col=cols[i], lty=ltys[i])
I didn’t use labels in this case as it crowded the MDS visual.
In this case the samples are grouped in pre and post bed rest. There seems to some clustering between -0.5 and -1.0 of the x-axis, as well as 1.0 and 1.5, but it is difficult to see if one group is more prominent than another in the grouping.
plotMDS(d, labels="o",
col = c("darkgreen","blue")[factor(samples["time",])])
Let’s try to visualize in terms of low and high susceptibility to disuse-induced insulin resistance. It looks a bit clearer, and we can see there is a definite difference in groupings of the clusters. Meaning that those with similar susceptibility have similar expression.
plotMDS(d, labels="o",
col = c("darkgreen","blue")[factor(samples["susceptibility",])])
Perhaps age is also a factor, let’s visualize this. This group clusering is even more distinguishable here, showing the shared expression profiles of the age groups.
plotMDS(d, labels="o",
col = c("darkgreen","blue")[factor(samples["age",])])
We now map the ENSEMBL gene ids to HUGO (hgnc) identifiers using biomart, particularly the getGM function.
#Adding another column to normalized_counts for ease of merging later
ensembl_labels <- data.frame(rownames(normalized_counts))
colnames(ensembl_labels) <- "ensembl_gene_id"
normalized_counts <- cbind(normalized_counts, ensembl_labels)
#Loading emsembl human data
mart <- useEnsembl(biomart = "ensembl",
dataset = "hsapiens_gene_ensembl",
mirror = "useast")
#Retreive HUGO symbols for each ENSEMBL ID and set as new rownames
sample_HUGO <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "ensembl_gene_id", values = c(expCounts_filtered[1]), mart = mart)
Some of the gene ids are not returned by getGM (depreciated or no symbol available) so these are not included in the dataset after merging with the returned ids.
#Integrating into counts
normalized_counts_annot <- merge(sample_HUGO,normalized_counts, by.x="ensembl_gene_id", by.y="ensembl_gene_id", all.y=FALSE)
In addition, some of these returned gene ids have been returned with an empty string for the HUGO identifier, and these are also removed from the dataset.
#Empty HUGO Removal
nonempty_HUGO <- normalized_counts_annot["" != normalized_counts_annot[, 2],]
I also realized that there may be duplicate HUGO ids, so we remove the replicates and only keep a mean of the expression values. This means that as a result there is no ENSEMBL id associated with these expression values, this may cause an issue later, but we can always adjust this to map to one of the duplicate ids.
#Unique HUGO symbols only
HUGO_duplicate_name <- nonempty_HUGO[duplicated(nonempty_HUGO[,2]),2]
for (x in HUGO_duplicate_name){
dup_index <- which(x == nonempty_HUGO[,2])
#Store ENSEMBL ID
dup_ENSEMBL <- nonempty_HUGO[dup_index, 1]
#Mean expression values
replacement_dup <- data.frame(t(colMeans(nonempty_HUGO[dup_index, 3:58])))
colnames(replacement_dup) <- gsub(colnames(replacement_dup), pattern="^X", replace="")
#Save associated ids
replacement_dup[,"ensembl_gene_id"] <- ""
replacement_dup[,"hgnc_symbol"] <- HUGO_duplicate_name[x]
#Remove duplicate rows
removed_dup <- nonempty_HUGO[-dup_index,]
#Add meaned duplicates back
unique_counts <- rbind(removed_dup, replacement_dup)
}
#Preview into the dataset
kable(head(unique_counts))
| ensembl_gene_id | hgnc_symbol | 10876X1 | 10876X2 | 10876X3 | 10876X4 | 10876X5 | 10876X6 | 10876X7 | 10876X8 | 10876X9 | 14406X1 | 14406X2 | 14406X3 | 14406X4 | 14406X5 | 14406X6 | 14406X7 | 14406X8 | 14406X9 | 10876X10 | 10876X11 | 10876X12 | 10876X13 | 10876X14 | 10876X15 | 10876X16 | 10876X17 | 10876X18 | 10876X19 | 10876X20 | 10876X21 | 10876X22 | 10876X23 | 10876X24 | 10876X25 | 10876X26 | 10876X27 | 10876X28 | 10876X29 | 10876X30 | 10876X31 | 10876X32 | 10876X33 | 10876X34 | 10876X35 | 10876X36 | 14406X10 | 14406X11 | 14406X12 | 14406X13 | 14406X14 | 14406X15 | 14406X16 | 14406X17 | 14406X18 | 14406X19 | 14406X20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ENSG00000000003 | TSPAN6 | 5.289463 | 4.025953 | 5.7483737 | 3.3755463 | 4.052595 | 4.3699835 | 4.722525 | 4.079050 | 5.3109728 | 4.517023 | 3.174709 | 3.927295 | 2.8252135 | 3.221141 | 3.513752 | 2.258656 | 3.2873595 | 4.672469 | 3.570428 | 2.206914 | 1.638454 | 4.3521590 | 4.7169699 | 5.1087359 | 6.016743 | 5.454376 | 4.884660 | 4.648841 | 3.708616 | 3.8897329 | 3.526400 | 4.267296 | 2.9441147 | 3.4847624 | 3.779075 | 3.302813 | 3.705487 | 3.387705 | 4.479046 | 3.2853820 | 3.3742099 | 3.061681 | 2.522700 | 1.747284 | 3.236915 | 3.241200 | 1.763651 | 3.387475 | 4.070012 | 4.3549460 | 3.505752 | 3.132316 | 3.687560 | 2.810795 | 3.641845 | 3.529361 |
| ENSG00000000419 | DPM1 | 22.345283 | 27.327681 | 26.6003176 | 13.7722291 | 22.364322 | 17.3758868 | 18.103013 | 15.879157 | 23.3334541 | 23.731228 | 18.550263 | 22.379349 | 16.6053366 | 15.177579 | 30.452516 | 15.003933 | 18.3705385 | 21.697441 | 19.583254 | 18.564045 | 7.373045 | 23.6808651 | 23.4607189 | 23.5001852 | 16.433791 | 22.063195 | 20.799196 | 13.074864 | 9.949945 | 17.9265953 | 21.429661 | 21.024238 | 16.3889053 | 18.4798004 | 22.579974 | 16.018645 | 21.187782 | 21.003774 | 19.057903 | 22.7077877 | 19.3119670 | 20.190541 | 15.035290 | 17.285628 | 15.035993 | 17.308009 | 14.329662 | 11.410443 | 23.063403 | 19.2622613 | 16.554939 | 13.794240 | 11.638861 | 10.156338 | 21.677648 | 14.295696 |
| ENSG00000000457 | SCYL3 | 14.680959 | 13.663840 | 13.3001588 | 12.6920542 | 14.559324 | 17.5839812 | 16.686256 | 13.985313 | 14.7140065 | 11.865614 | 13.694825 | 12.218252 | 12.7999469 | 13.266733 | 14.494226 | 11.615948 | 13.6651023 | 14.232232 | 15.255463 | 14.929127 | 12.054343 | 13.8245050 | 16.6335256 | 13.6756931 | 14.817353 | 13.021707 | 14.496409 | 13.801245 | 13.206291 | 13.8677435 | 13.201394 | 11.344762 | 14.2298879 | 13.5166540 | 13.604671 | 14.945230 | 11.876559 | 13.163655 | 12.383245 | 13.7213015 | 12.7789224 | 15.970387 | 14.127118 | 12.480598 | 14.148452 | 14.131633 | 12.198584 | 12.539602 | 12.492677 | 12.6740096 | 12.140288 | 11.987134 | 11.062680 | 11.955247 | 14.532695 | 12.299289 |
| ENSG00000000460 | C1orf112 | 2.482809 | 2.317973 | 2.9305435 | 2.8354589 | 2.926874 | 3.3295112 | 3.148350 | 2.622246 | 3.9179307 | 4.651860 | 4.232946 | 2.992225 | 3.4017877 | 2.511398 | 3.172137 | 4.732423 | 6.3168869 | 4.403936 | 3.245843 | 1.557822 | 2.808779 | 2.3040842 | 1.9860926 | 2.9080497 | 2.873668 | 3.587112 | 3.991765 | 4.358288 | 3.075438 | 3.7206141 | 2.531774 | 3.850974 | 3.5329377 | 5.1743441 | 5.007275 | 3.633095 | 3.325437 | 3.678080 | 3.161680 | 4.2516709 | 3.4460016 | 4.220154 | 3.834503 | 3.244956 | 3.445748 | 4.278384 | 2.865932 | 3.862911 | 4.239596 | 3.6291217 | 3.635594 | 4.216580 | 4.753495 | 3.297999 | 4.162108 | 3.921513 |
| ENSG00000000938 | FGR | 2.914602 | 0.853990 | 0.7889925 | 0.6751093 | 2.176394 | 0.8323778 | 1.416758 | 1.456803 | 0.6094559 | 1.011274 | 1.493981 | 1.059746 | 0.8072039 | 2.347611 | 1.756876 | 1.451993 | 0.7734964 | 1.342663 | 2.704869 | 9.866206 | 3.862071 | 0.8960327 | 0.7447847 | 0.5501716 | 1.347032 | 1.277602 | 2.888777 | 2.179144 | 1.266357 | 0.5919159 | 1.265887 | 1.353045 | 0.8832344 | 0.9503897 | 1.133723 | 0.990844 | 3.040399 | 1.161499 | 2.019962 | 0.5797733 | 0.6461253 | 2.896184 | 1.109988 | 47.800691 | 4.020040 | 1.037184 | 2.571991 | 4.100628 | 1.130559 | 0.8933223 | 1.168584 | 1.445685 | 1.526880 | 1.161795 | 1.456738 | 1.782506 |
| ENSG00000000971 | CFH | 38.861361 | 16.713805 | 26.8257441 | 23.3587807 | 35.872973 | 34.1274902 | 35.104104 | 18.647084 | 40.6594144 | 29.866290 | 29.817369 | 27.179376 | 20.8143281 | 22.984751 | 27.085171 | 26.135883 | 33.9049236 | 35.714847 | 32.891211 | 27.651340 | 12.054343 | 27.2649960 | 17.2541795 | 28.6875171 | 25.324203 | 17.493312 | 19.433592 | 24.987518 | 23.789415 | 33.1472894 | 38.519137 | 55.787087 | 30.0299702 | 19.6413878 | 20.123575 | 47.395372 | 40.380301 | 33.489888 | 44.263516 | 42.9998532 | 54.4898995 | 45.428718 | 32.492372 | 34.072033 | 35.919317 | 25.735129 | 26.454761 | 37.797094 | 31.146900 | 26.1855088 | 20.580061 | 26.323506 | 15.384039 | 13.716679 | 18.590751 | 15.507800 |
What are the control and test conditions of the dataset?
Control condition: Vastus lateralis cells of YOUNG (N=9) and OLD (N=18) men and women before five days of bed rest
Test condition: Vastus lateralis cells of YOUNG (N=9) and OLD (N=18) men and women after five days of bed rest
Why is the dataset of interest to you?
This dataset stood out to me as there are no direct interventions aside from a change in physical behaviour. I usually do not encounter experimental studies that do not incorporate pharmaceutical/biochemical intervention, so I was interested to see the resulting gene expression values.
Were there expression values that were not unique for specific genes? How did you handle these?
Yes. These were removed.
#Number of duplicates
length(setdiff(nonempty_HUGO[duplicated(nonempty_HUGO[,2]), 2], unique_counts))
## [1] 1
Were there expression values that could not be mapped to current HUGO symbols?
Yes. These were removed from the final dataset. The function call to getBM excluded depreciated ids and some of those with no HUGO symbol. There were some empty HUGO symbol strings returned, these were also removed from the final dataset.
#Number of values that could not be mapped to HUGO symbols
length(setdiff(rownames(normalized_counts),nonempty_HUGO[,1]))
## [1] 1704
How many outliers were removed? Low count outliers were removed and assigned to this variable. No other outliers were removed.
#Low count outliers
dim(expCounts_filtered)
## [1] 16674 57
How did you handle replicates? I averaged duplicate HUGO genes and added them to the dataset after removing the old genes.Aside from that, there were no replicate rows in the dataset after filtering, normalization, HUGO mapping and removing duplicate HUGO genes. Otherwise, I would remove them.
table(duplicated(nonempty_HUGO))
##
## FALSE
## 14970
What is the final coverage of your dataset?
#Starting coverage from given expression data
dim(expCounts)
## [1] 58051 57
#Coverage following filtering for low counts
dim(expCounts_filtered)
## [1] 16674 57
#Coverage after mapping to HUGO symbols
dim(normalized_counts_annot)
## [1] 16578 58
#Coverage after filtering for nonempty and unique symbols
dim(unique_counts)
## [1] 14969 58
As we can see from the above code snippets, the initial dataset contained 58051 gene ids, however our final dataset, following all the steps of processing, contains 14969 gene ids.
[1] Mahmassani ZS, Reidy PT, McKenzie AI, Stubben C et al. Age-dependent skeletal muscle transcriptome response to bed rest-induced atrophy. J Appl Physiol (1985) 2019 Apr 1;126(4):894-902. PMID: 30605403
[2] Mahmassani ZS, Reidy PT, McKenzie AI, Stubben C et al. Disuse-induced insulin resistance susceptibility coincides with a dysregulated skeletal muscle metabolic transcriptome. J Appl Physiol (1985) 2019 May 1;126(5):1419-1429. PMID: 30763167
[3] Isserlin R, BCB420-lectures-public, (2020), GitHub repository, https://github.com/risserlin/BCB420-lectures-public
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("GEOquery", quietly = TRUE))
BiocManager::install("GEOquery")
if (!requireNamespace("biomaRt", quietly = TRUE))
BiocManager::install("biomaRt")
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("knitr", quietly = TRUE))
BiocManager::install("knitr")
library(GEOquery)
library(biomaRt)
library(edgeR)
library(knitr)
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("circlize", quietly = TRUE))
install.packages("circlize")
if (!requireNamespace("dplyr", quietly = TRUE))
install.packages("dplyr")
if (!requireNamespace("gprofiler2", quietly = TRUE))
install.packages("gprofiler2")
library(ComplexHeatmap)
library(circlize)
library(dplyr)
library(gprofiler2)
We begin with a bit of adjustments to data and matrices that we will use for the analysis. From Assignment 1, I need the sample data to hold the different patient parameters in each column rather than rows. We also adjust the column names in the count data to the descriptive titles so we can access and thus compare the conditions more easily. This is all followed by the creation of matrices for furthur functions.
#Adjust sample set to desired format
samples <- data.frame(GSE113165_data$description.1, GSE113165_data$`subject:ch1`, GSE113165_data$`age:ch1`, GSE113165_data$`Sex:ch1`, gsub("susceptibility: ", "", GSE113165_data$characteristics_ch1.5), gsub("time: ", "", GSE113165_data$characteristics_ch1.1))
colnames(samples) <- c("id", "subject","age", "sex", "susceptibility", "time")
#Adding a new attribute to create specific types based on ANOVA
samples[,"type"] <- paste(samples[,3], lapply(strsplit(as.character(samples[,6]), " "), `[[`, 1), sep="_")
#Changing colnames from patient-sample ids to the descriptive title
match_index <- match(colnames(unique_counts[3:58]), GSE113165_data[,"description.1"])
colnames(unique_counts[3:58]) <- GSE113165_data[match_index,"title"]
unique_counts <- unique_counts[which(!is.na(unique_counts[,2])),]
#Data matrix for expression
expressionMatrix <- as.matrix(unique_counts[,3:58])
rownames(expressionMatrix) <- unique_counts[,2]
colnames(expressionMatrix) <- colnames(unique_counts)[3:58]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
#Create numerical matrix to produce a heatmap plot
heatmap_matrix <- unique_counts[,3:ncol(unique_counts)]
rownames(heatmap_matrix) <- unique_counts$ensembl_gene_id
colnames(heatmap_matrix) <- colnames(unique_counts[,3:ncol(unique_counts)])
#Row Normalization; scaling each row around the mean
heatmap_matrix <- t(scale(t(heatmap_matrix)))
#Changing colnames from patient-sample ids to the descriptive title
match_index <- match(colnames(heatmap_matrix), GSE113165_data[,"description.1"])
colnames(heatmap_matrix) <- GSE113165_data[match_index,"title"]
Now we begin the differential expression
#Creating a design matrix based on the condition that is being tested and accounting for subject variability (i.e. Pre and post bed rest)
model_ANOVA <- model.matrix(~ samples$type)
#Fit to model_design
fit_ANOVA <- lmFit(minimalSet, model_ANOVA)
#Computing differential expression for the model
#trend is TRUE as this is a RNAseq dataset
fit2_ANOVA <- eBayes(fit_ANOVA,trend=TRUE)
#Adjust fix based on Benjamin-Hochberg multiple hypothesis method
topfit_ANOVA <- topTable(fit2_ANOVA,
coef=c(2:ncol(model_ANOVA)),
adjust.method = "BH",
number = nrow(expressionMatrix))
#Merge hgnc names to topfit table
output_hits_ANOVA <- merge(unique_counts[,1:2],
topfit_ANOVA,
by.y=0,by.x=2,
all.y=TRUE)
#Sort by P-value
output_hits_ANOVA <- output_hits_ANOVA[order(output_hits_ANOVA$P.Value),]
1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?
The P-values are generated by limma::eBayes().
(head(output_hits_ANOVA[,c("hgnc_symbol","ensembl_gene_id","P.Value")]))
Based on the code below, we can see that 2241 genes are significantly differentially expressed and within our threshold (< 0.05). We use the general rule that a p-value of greater than 0.05 represents the probability that the null hypothesis is true, therefore a p-value of less than 0.05 is statistically significant.
#How many gene pass the threshold p-value < 0.05?
length(which(output_hits_ANOVA$P.Value < 0.05))
## [1] 2241
2. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?
Based on the default adjust.method parameter for edgeR::topTags(),the Benjamin-Hochberg method is used for multiple correction. This method is used over the others as a default and for this analysis, as it represents the false discovery rate amoungst the differentially expressed genes, which is less “stringent” than familywise error rate and thus more powerful.
From the code below, we can see that 163 gene pass correction.
#How many genes pass correction?
length(which(output_hits_ANOVA$adj.P.Val < 0.05))
## [1] 163
3. Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.
Here we have the volcano plot representing the differentially expressed genes within our dataset. The x-axis bound represents a p-value of 0.05.
volcanoplot(fit2_ANOVA, coef=c(2:ncol(model_ANOVA)), style = "p-value",
xlab = "log2 Fold Change", ylab = NULL, pch=16, cex=0.35, xlim=c(-5, 5), ylim=c(0, 6), highlight = 10, hl.col = "blue", names=rownames(fit_ANOVA))
abline(h=1.3,v=0, lty=2, col="red")
The top differentially expressed genes are as follows (without taking into account valid p-value and adjusted p-value),
#Based on topTable
head(topfit_ANOVA, n=10)
#Based on P-value
head(output_hits_ANOVA, n=10)
4. Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.
There is subtle clustering in this heatmap, but it’s not very clear.
top_hits <- output_hits_ANOVA$ensembl_gene_id[output_hits_ANOVA$P.Value<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[
which(rownames(heatmap_matrix) %in% top_hits),])))
#Grouping samples based on time condition (i.e. pre or post 5 days of bed rest) and susceptability to disuse-induced insulin resistance (i.e. low, high, if available)
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
c(grep(colnames(heatmap_matrix_tophits),pattern = "pre\\_low"),
grep(colnames(heatmap_matrix_tophits),pattern = "pre\\_high"),
grep(colnames(heatmap_matrix_tophits),pattern = "pre$"),
grep(colnames(heatmap_matrix_tophits),pattern = "post\\_low"),
grep(colnames(heatmap_matrix_tophits),pattern = "post\\_high"),
grep(colnames(heatmap_matrix_tophits),pattern = "post$"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
#Create subtype annotation vector based on matrix columns for a top annotation bar
typeAnno <- c()
susAnno <- c()
for (x in 1:ncol(heatmap_matrix_tophits)){
typeAnno <- append(typeAnno, samples[which(samples$id == lapply(strsplit(colnames(heatmap_matrix_tophits)[x], "_"), `[[`, 1)), 7])
susAnno <- append(susAnno, as.character(samples[which(samples$id == lapply(strsplit(colnames(heatmap_matrix_tophits)[x], "_"), `[[`, 1)), 5]))
}
n <- HeatmapAnnotation(type = typeAnno, susceptibility = susAnno)
#Create and display heatmap
time_sus_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = FALSE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
top_annotation = n
)
time_sus_heatmap
I notice that if the patients are also sorted/grouped by age, there does seem to be subtle clustering of expression within the age groups (i.e. YOUNG, OLD) and time conditions (i.e. pre bed rest and post 5 days of bed rest). This clustering is supported by statements in the paper that aging corresponds to “biochemical, structural, and functional alterations in skeletal muscle” which therefore would reflect when comparing the expression of young and old patients samples. The clustering of the two time conditions are more prevalent in the old group, which is also supported by the paper.
#Rename samples with descriptive labels for sorting (seen below)
colnames(heatmap_matrix_tophits) <- GSE113165_data[match_index,"title"]
#Grouping samples based on time condition (i.e. pre or post 5 days of bed rest), susceptability to disuse-induced insulin resistance (i.e. low, high, if available) and age.
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
c(grep(colnames(heatmap_matrix_tophits),pattern = "young.*pre\\_low"),
grep(colnames(heatmap_matrix_tophits),pattern = "young.*pre\\_high"),
grep(colnames(heatmap_matrix_tophits),pattern = "young.*post\\_low"),
grep(colnames(heatmap_matrix_tophits),pattern = "young.*post\\_high"),
grep(colnames(heatmap_matrix_tophits),pattern = "old.*pre\\_low"),
grep(colnames(heatmap_matrix_tophits),pattern = "old.*pre\\_high"),
grep(colnames(heatmap_matrix_tophits),pattern = "old.*pre$"),
grep(colnames(heatmap_matrix_tophits),pattern = "old.*post\\_low"),
grep(colnames(heatmap_matrix_tophits),pattern = "old.*post\\_high"),
grep(colnames(heatmap_matrix_tophits),pattern = "old.*post$"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
#Create subtype annotation vector based on matrix columns for a top annotation bar
typeAnno <- c()
susAnno <- c()
for (x in 1:ncol(heatmap_matrix_tophits)){
typeAnno <- append(typeAnno, samples[which(samples$id == lapply(strsplit(colnames(heatmap_matrix_tophits)[x], "_"), `[[`, 1)), 7])
susAnno <- append(susAnno, as.character(samples[which(samples$id == lapply(strsplit(colnames(heatmap_matrix_tophits)[x], "_"), `[[`, 1)), 5]))
}
n <- HeatmapAnnotation(type = typeAnno, susceptibility = susAnno)
#Create and display heatmap
time_age_sus_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = FALSE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
top_annotation = n
)
time_age_sus_heatmap
#Read in supplementary files
sfiles = getGEOSuppFiles('GSE70072')
fnames = rownames(sfiles)
my_exp = read.delim(fnames[1],header=TRUE,
check.names = FALSE)
#Merge gene names with the top hits
output_hits_ANOVA_withgn <- ""
output_hits_ANOVA_withgn <- merge(my_exp[,1],output_hits_ANOVA, by.x=1, by.y = 2)
output_hits_ANOVA_withgn[,c("old_pre_rank", "young_post_rank", "young_pre_rank")] <- -log(output_hits_ANOVA_withgn$P.Value,base =10) * sign(output_hits_ANOVA_withgn[,c("samples.typeold_pre", "samples.typeyoung_post", "samples.typeyoung_pre")])
#Hits based on logFC values for each grouping/type
output_hits_old_pre_rank <- output_hits_ANOVA_withgn[order(output_hits_ANOVA_withgn$old_pre_rank),]
output_hits_young_post_rank <- output_hits_ANOVA_withgn[order(output_hits_ANOVA_withgn$young_post_rank),]
output_hits_young_pre_rank <- output_hits_ANOVA_withgn[order(output_hits_ANOVA_withgn$young_pre_rank),]
#Type-"old_pre" up/down regulated genes
old_pre_upregulated_genes <- output_hits_old_pre_rank$hgnc_symbol[
which(output_hits_old_pre_rank$P.Value < 0.05
& output_hits_old_pre_rank$samples.typeold_pre > 0)]
old_pre_downregulated_genes <- output_hits_old_pre_rank$hgnc_symbol[
which(output_hits_old_pre_rank$P.Value < 0.05
& output_hits_old_pre_rank$samples.typeold_pre < 0)]
#Type-"young_post" up/down regulated genes
young_post_upregulated_genes <- output_hits_young_post_rank$hgnc_symbol[
which(output_hits_young_post_rank$P.Value < 0.05
& output_hits_young_post_rank$samples.typeyoung_post > 0)]
young_post_downregulated_genes <- output_hits_young_post_rank$hgnc_symbol[
which(output_hits_young_post_rank$P.Value < 0.05
& output_hits_young_post_rank$samples.typeyoung_post < 0)]
#Type-"young_pre" up/down regulated genes
young_pre_upregulated_genes <- output_hits_young_pre_rank$hgnc_symbol[
which(output_hits_young_pre_rank$P.Value < 0.05
& output_hits_young_pre_rank$samples.typeyoung_pre > 0)]
young_pre_downregulated_genes <- output_hits_young_pre_rank$hgnc_symbol[
which(output_hits_young_pre_rank$P.Value < 0.05
& output_hits_young_pre_rank$samples.typeyoung_pre < 0)]
1. Which method did you choose and why?
There was a pretty substantial difference between the limma and quasi methods, so I chose the limma results as it seemed to align best with the results of the paper and more refined (less genes).
2. What annotation data did you use and why? What version of the annotation are you using?
I decided to use g:Profiler, in particular the gprofiler2 package, as it is familiar to me and also provides a wide variety of sources to get a good overview and grasp of the processess effected by the up and down regulated genes. “The package corresponds to the 2019 update of g:Profiler and provides access for versions e94_eg41_p11 and higher.”
3. How many genesets were returned with what thresholds?
Number of genes in rank list for old_pre…
nrow(output_hits_old_pre_rank)
## [1] 14820
#Preview of table/thresholds
head(output_hits_old_pre_rank)
Number of genes in rank list for young_post…
#Number of genes in rank list for young_post
nrow(output_hits_young_post_rank)
## [1] 14820
#Preview of table/thresholds
head(output_hits_young_post_rank)
Number of genes in rank list for young_pre…
#Number of genes in rank list for young_pre
nrow(output_hits_young_pre_rank)
## [1] 14820
#Preview of table/thresholds
head(output_hits_young_pre_rank)
4. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list?
I only analyzed the young post-bed rest samples for simplicity’s sake.
#g:Profiler queries
#Multi-query: young_post
gostres2 <- gost(query = c(young_post_upregulated_genes,young_post_downregulated_genes), organism = "hsapiens", sources = c("HPA","HP"))
gostres2up <- gost(query = young_post_upregulated_genes, organism = "hsapiens", sources = c("HPA","HP"))
gostres2down <- gost(query = young_post_downregulated_genes, organism = "hsapiens", sources = c("HPA", "HP"))
#Print tables
pt2 <- publish_gosttable(head(gostres2, n = 10),
highlight_terms = gostres2$result[c(1:20),],
use_colors = TRUE,
show_columns = c("term_id", "term_name"),
filename = NULL)
pt2up <- publish_gosttable(head(gostres2up, n = 10),
highlight_terms = gostres2up$result[c(1:20),],
use_colors = TRUE,
show_columns = c("term_id", "term_name"),
filename = NULL)
pt2down <- publish_gosttable(head(gostres2down, n = 10),
highlight_terms = gostres2down$result[c(1:20),],
use_colors = TRUE,
show_columns = c("term_id", "term_name"),
filename = NULL)
As shown, the first table is our multi-query and the following two are the up and down regulated sets queried individually. There is crossover between the multiquery and upregulated-query result sets, however the terms provided in the down regulated-query result is significantly shorter and less relevant.
1. Do the over-representation results support conclusions or mechanism discussed in the original paper?
Based on the descriptions given by the g:profiler query, it seems that it results do support conclusions discussed in the original paper. Many of the genes are involved in processes that focus on estrogen and female reproductive organs, which also appear in the paper. Not all of the genes were queried above, however the downregulated genes for the young post-bed rest samples did not produce accurate results - this is likely due to my source specification.
2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.
In a 2018 study by Dickinson JM et. al. in which the transcriptome of leg skeletal muscle before and after divergent exercise stimuli was observed, the results highlight a notable “increase in estrogen-related receptor-γ”. However it is difficult to find more papers to support this finding. It is likely the presence of other highlighted genes in the paper that supports the change in the transcriptome of vastus lateralis cells.
[1] Davis S, Meltzer P (2007). “GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics, 14, 1846–1847.
[2] Durinck S, Spellman P, Birney E, Huber W (2009). “Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.” Nature Protocols, 4, 1184–1191.
[3] Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W (2005). “BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis.” Bioinformatics, 21, 3439–3440.
[4] Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616.
[5] McCarthy DJ, Chen Y, Smyth GK (2012). “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research, 40(10), 4288-4297. doi: 10.1093/nar/gks042.
[6] Yihui Xie (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.27.
[7] Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC. ISBN 978-1498716963
[8] Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research.Chapman and Hall/CRC. ISBN 978-1466561595
[9] Gu, Z. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.
[10] Gu, Z. (2014) circlize implements and enhances circular visualization in R. Bioinformatics.
[11] Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2020). dplyr: A Grammar of Data Manipulation. R package version 0.8.4. https://CRAN.R-project.org/package=dplyr
[12] Mahmassani ZS, Reidy PT, McKenzie AI, Stubben C et al. Age-dependent skeletal muscle transcriptome response to bed rest-induced atrophy. J Appl Physiol (1985) 2019 Apr 1;126(4):894-902. PMID: 30605403
[13] Mahmassani ZS, Reidy PT, McKenzie AI, Stubben C et al. Disuse-induced insulin resistance susceptibility coincides with a dysregulated skeletal muscle metabolic transcriptome. J Appl Physiol (1985) 2019 May 1;126(5):1419-1429. PMID: 30763167
[14] Isserlin R, BCB420-lectures-public, (2020), GitHub repository, https://github.com/risserlin/BCB420-lectures-public
[15] https://support.bioconductor.org/p/12441/
[16] https://support.bioconductor.org/p/18967/
[17] Liis Kolberg and Uku Raudvere (2019). gprofiler2: Interface to the ‘g:Profiler’ Toolset. R package version 0.1.8. https://CRAN.R-project.org/package=gprofiler2
[18] Dickinson JM, D’Lugos AC, Naymik MA, Siniard AL et al. Transcriptome response of human skeletal muscle to divergent exercise stimuli. J Appl Physiol (1985) 2018 Jun 1;124(6):1529-1540. PMID: 29543133